1 A well-defined HDX-MS experiment

This vignette describeds how to analyse time-resolved differential HDX-MS experiments. The key elements are at least two conditions i.e. apo + antibody, apo + small molecule or protein closed + protien open, etc. The experiment can be replicated, though if there are sufficient time points analysed (>=3) then occasionally signficant results can be obtained. The data provided should be centroid-centric data. This package does not yet support analysis straight from raw spectra. Typically this will be provided as a .csv from tools such as dynamiX or HDExaminer.

2 Main elements of the package

The package relies of Bioconductor infrastructure so that it integrates with other data types and can benefit from advantages in other fields of mass-spectrometry. There are package specific object, classes and methods but importantly there is reuse of classes found in quantitative proteomics data, mainly the QFeatures object which extends the SummarisedExperiment class for mass spectrometry data. The focus of this package is on testing and visualisation of the testing results.

3 Data

We will begin with a structural variant experiment in which MHP and a structural variant were mixed in different proportions. HDX-MS was performed on these samples and we expect to see reproducible but subtle differences. We first load the data from the package and it is .csv format.

MBPpath <- system.file("extdata", "MBP.csv", package = "hdxstats")

We can now read in the .csv file and have a quick look at the .csv.

MBP <- read.csv(MBPpath)
head(MBP) # have a look
##   hx_sample pep_start pep_end pep_sequence pep_charge     d confidence  score
## 1       10%        19      30 VIWINGDKGYNG          2 2.120     medium 0.8686
## 2       10%        19      30 VIWINGDKGYNG          2 2.146     medium 0.8173
## 3       10%        19      30 VIWINGDKGYNG          2 2.143     medium 0.8839
## 4       10%        19      30 VIWINGDKGYNG          2    NA       <NA>     NA
## 5       10%        19      30 VIWINGDKGYNG          2    NA       <NA>     NA
## 6       10%        19      30 VIWINGDKGYNG          2    NA       <NA>     NA
##   hx_time time_unit replicate_cnt
## 1      30         s             1
## 2      30         s             2
## 3      30         s             3
## 4      30         s             4
## 5      30         s             5
## 6      30         s             6
length(unique(MBP$pep_sequence)) # peptide sequences
## [1] 115

Let us have a quick visualisation of some the data so that we can see some of the features

filter(MBP, pep_sequence == unique(MBP$pep_sequence[1]), pep_charge == 2) %>%
    ggplot(aes(x = hx_time, y = d, group = factor(replicate_cnt),
               color = factor(hx_sample,
                              unique(MBP$hx_sample)[c(7,5,1,2,3,4,6)]))) + 
    theme_classic() + geom_point(size = 2) + 
    scale_color_manual(values = brewer.pal(n = 7, name = "Set2")) + 
    labs(color = "experiment", x = "Deuterium Exposure", y = "Deuterium incoperation")
## Warning: Removed 96 rows containing missing values (geom_point).

We can see that the units of the time dimension are in seconds and that Deuterium incoperation has been normalized into Daltons.

4 Parsing to an object of class QFeatures

Working from a .csv is likely to cause issues downstream. Indeed, we run the risk of accidently changing the data or corrupting the file in some way. Secondly, all .csvs will be formatted slightly different and so making extensible tools for these files will be inefficient. Furthermore, working with a generic class used in other mass-spectrometry fields can speed up analysis and adoption of new methods. We will work the class QFeatures from the QFeatures class as it is a powerful and scalable way to store quantitative mass-spectrometry data.

Firstly, the data is storted in long format rather than wide format. We first switch the data to wide format.

MBP_wide <- pivot_wider(data.frame(MBP),
                        values_from = d,
                        names_from = c("hx_time", "replicate_cnt", "hx_sample"),
                        id_cols = c("pep_sequence", "pep_charge"))
head(MBP_wide)
## # A tibble: 6 x 198
##   pep_sequence pep_charge `30_1_10%` `30_2_10%` `30_3_10%` `30_4_10%` `30_5_10%`
##   <chr>             <int>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
## 1 VIWINGDKGYNG          2      2.12       2.15       2.14          NA         NA
## 2 VIWINGDKGYN~          2      2.12       2.12       2.11          NA         NA
## 3 GDKGYNGLAEVG          3      0.552      0.555      0.553         NA         NA
## 4 LAEVGKKFEKD~          4      2.41       2.36       2.42          NA         NA
## 5 AEVGKKFEKDT~          4      0.458      0.425      0.573         NA         NA
## 6 TVEHPDKL              3      1.43       1.44       1.44          NA         NA
## # ... with 191 more variables: 30_6_10% <dbl>, 30_7_10% <dbl>, 240_1_10% <dbl>,
## #   240_2_10% <dbl>, 240_3_10% <dbl>, 240_4_10% <dbl>, 240_5_10% <dbl>,
## #   240_6_10% <dbl>, 240_7_10% <dbl>, 1800_1_10% <dbl>, 1800_2_10% <dbl>,
## #   1800_3_10% <dbl>, 1800_4_10% <dbl>, 1800_5_10% <dbl>, 1800_6_10% <dbl>,
## #   1800_7_10% <dbl>, 14400_1_10% <dbl>, 14400_2_10% <dbl>, 14400_3_10% <dbl>,
## #   14400_4_10% <dbl>, 14400_5_10% <dbl>, 14400_6_10% <dbl>, 14400_7_10% <dbl>,
## #   30_1_15% <dbl>, 30_2_15% <dbl>, 30_3_15% <dbl>, 30_4_15% <dbl>, ...

We notice that there are many columns with NAs. The follow code chunk removes these columns.

MBP_wide <- MBP_wide[, colSums(is.na(MBP_wide)) != nrow(MBP_wide)]

We also note that the colnames are not very informative. We are going to format in a very specific way so that later functions can automatically infer the design from the column names. We provide in the format X(time)rep(replicate)cond(condition)

colnames(MBP_wide)[-c(1,2)]
##   [1] "30_1_10%"        "30_2_10%"        "30_3_10%"        "240_1_10%"      
##   [5] "240_2_10%"       "240_3_10%"       "1800_1_10%"      "1800_2_10%"     
##   [9] "1800_3_10%"      "14400_1_10%"     "14400_2_10%"     "14400_3_10%"    
##  [13] "30_1_15%"        "30_2_15%"        "30_3_15%"        "240_1_15%"      
##  [17] "240_2_15%"       "240_3_15%"       "1800_1_15%"      "1800_2_15%"     
##  [21] "1800_3_15%"      "14400_1_15%"     "14400_2_15%"     "14400_3_15%"    
##  [25] "30_1_20%"        "30_2_20%"        "30_3_20%"        "240_1_20%"      
##  [29] "240_2_20%"       "240_3_20%"       "1800_1_20%"      "1800_2_20%"     
##  [33] "1800_3_20%"      "14400_1_20%"     "14400_2_20%"     "14400_3_20%"    
##  [37] "30_1_25%"        "30_2_25%"        "30_3_25%"        "240_1_25%"      
##  [41] "240_2_25%"       "240_3_25%"       "1800_1_25%"      "1800_2_25%"     
##  [45] "1800_3_25%"      "14400_1_25%"     "14400_2_25%"     "14400_3_25%"    
##  [49] "30_1_5%"         "30_2_5%"         "30_3_5%"         "240_1_5%"       
##  [53] "240_2_5%"        "240_3_5%"        "1800_1_5%"       "1800_2_5%"      
##  [57] "1800_3_5%"       "14400_1_5%"      "14400_2_5%"      "14400_3_5%"     
##  [61] "30_1_W169G"      "30_2_W169G"      "30_3_W169G"      "240_1_W169G"    
##  [65] "240_2_W169G"     "240_3_W169G"     "1800_1_W169G"    "1800_2_W169G"   
##  [69] "1800_3_W169G"    "14400_1_W169G"   "14400_2_W169G"   "14400_3_W169G"  
##  [73] "30_1_WT Null"    "30_2_WT Null"    "30_3_WT Null"    "30_4_WT Null"   
##  [77] "30_5_WT Null"    "30_6_WT Null"    "30_7_WT Null"    "240_1_WT Null"  
##  [81] "240_2_WT Null"   "240_3_WT Null"   "240_4_WT Null"   "240_5_WT Null"  
##  [85] "240_6_WT Null"   "240_7_WT Null"   "1800_1_WT Null"  "1800_2_WT Null" 
##  [89] "1800_3_WT Null"  "1800_4_WT Null"  "1800_5_WT Null"  "1800_6_WT Null" 
##  [93] "1800_7_WT Null"  "14400_1_WT Null" "14400_2_WT Null" "14400_3_WT Null"
##  [97] "14400_4_WT Null" "14400_5_WT Null" "14400_6_WT Null" "14400_7_WT Null"
new.colnames <- gsub("0_", "0rep", paste0("X", colnames(MBP_wide)[-c(1,2)]))
new.colnames <- gsub("_", "cond", new.colnames)

# remove annoying % signs
new.colnames <- gsub("%", "", new.colnames)

# remove space (NULL could get confusing later and WT is clear)
new.colnames <- gsub(" .*", "", new.colnames)

new.colnames
##   [1] "X30rep1cond10"       "X30rep2cond10"       "X30rep3cond10"      
##   [4] "X240rep1cond10"      "X240rep2cond10"      "X240rep3cond10"     
##   [7] "X1800rep1cond10"     "X1800rep2cond10"     "X1800rep3cond10"    
##  [10] "X14400rep1cond10"    "X14400rep2cond10"    "X14400rep3cond10"   
##  [13] "X30rep1cond15"       "X30rep2cond15"       "X30rep3cond15"      
##  [16] "X240rep1cond15"      "X240rep2cond15"      "X240rep3cond15"     
##  [19] "X1800rep1cond15"     "X1800rep2cond15"     "X1800rep3cond15"    
##  [22] "X14400rep1cond15"    "X14400rep2cond15"    "X14400rep3cond15"   
##  [25] "X30rep1cond20"       "X30rep2cond20"       "X30rep3cond20"      
##  [28] "X240rep1cond20"      "X240rep2cond20"      "X240rep3cond20"     
##  [31] "X1800rep1cond20"     "X1800rep2cond20"     "X1800rep3cond20"    
##  [34] "X14400rep1cond20"    "X14400rep2cond20"    "X14400rep3cond20"   
##  [37] "X30rep1cond25"       "X30rep2cond25"       "X30rep3cond25"      
##  [40] "X240rep1cond25"      "X240rep2cond25"      "X240rep3cond25"     
##  [43] "X1800rep1cond25"     "X1800rep2cond25"     "X1800rep3cond25"    
##  [46] "X14400rep1cond25"    "X14400rep2cond25"    "X14400rep3cond25"   
##  [49] "X30rep1cond5"        "X30rep2cond5"        "X30rep3cond5"       
##  [52] "X240rep1cond5"       "X240rep2cond5"       "X240rep3cond5"      
##  [55] "X1800rep1cond5"      "X1800rep2cond5"      "X1800rep3cond5"     
##  [58] "X14400rep1cond5"     "X14400rep2cond5"     "X14400rep3cond5"    
##  [61] "X30rep1condW169G"    "X30rep2condW169G"    "X30rep3condW169G"   
##  [64] "X240rep1condW169G"   "X240rep2condW169G"   "X240rep3condW169G"  
##  [67] "X1800rep1condW169G"  "X1800rep2condW169G"  "X1800rep3condW169G" 
##  [70] "X14400rep1condW169G" "X14400rep2condW169G" "X14400rep3condW169G"
##  [73] "X30rep1condWT"       "X30rep2condWT"       "X30rep3condWT"      
##  [76] "X30rep4condWT"       "X30rep5condWT"       "X30rep6condWT"      
##  [79] "X30rep7condWT"       "X240rep1condWT"      "X240rep2condWT"     
##  [82] "X240rep3condWT"      "X240rep4condWT"      "X240rep5condWT"     
##  [85] "X240rep6condWT"      "X240rep7condWT"      "X1800rep1condWT"    
##  [88] "X1800rep2condWT"     "X1800rep3condWT"     "X1800rep4condWT"    
##  [91] "X1800rep5condWT"     "X1800rep6condWT"     "X1800rep7condWT"    
##  [94] "X14400rep1condWT"    "X14400rep2condWT"    "X14400rep3condWT"   
##  [97] "X14400rep4condWT"    "X14400rep5condWT"    "X14400rep6condWT"   
## [100] "X14400rep7condWT"

We will now parse the data into an object of class QFeatures, we have provided a function to assist with this in the package. If you want to do this yourself use the readQFeatures function from the QFeatures package.

MBPqDF <- parseDeutData(object = DataFrame(MBP_wide),
                        design = new.colnames,
                        quantcol = 3:102)

5 Heatmap visualisations of HDX data

To help us get used to the QFeatures we show how to generate a heatmap of these data from this object:

pheatmap(t(assay(MBPqDF)),
         cluster_rows = FALSE, 
         cluster_cols = FALSE,
         color = brewer.pal(n = 9, name = "BuPu"),
         main = "Stuctural variant deuterium incoperation heatmap", 
         fontsize = 14,
         legend_breaks = c(0, 2, 4, 6, 8, 10, 12, max(assay(MBPqDF))),
         legend_labels = c("0", "2", "4", "6", "8","10", "12", "Incorporation"))

If you prefer to have the start-to-end residue numbers in the heatmap instead you can change the plot as follows:

regions <- unique(MBP[,c("pep_start", "pep_end")])
xannot <- paste0("[", regions[,1], ",", regions[,2], "]")
pheatmap(t(assay(MBPqDF)),
         cluster_rows = FALSE, 
         cluster_cols = FALSE,
         color = brewer.pal(n = 9, name = "BuPu"),
         main = "Stuctural variant deuterium incoperation heatmap", 
         fontsize = 14,
         legend_breaks = c(0, 2, 4, 6, 8, 10, 12, max(assay(MBPqDF))),
         legend_labels = c("0", "2", "4", "6", "8","10", "12", "Incorporation"),
         labels_col = xannot)

It maybe useful to normalize HDX-MS data for either interpretation or visualization purposes. We can normalize by the number of exchangeable amides or by using back-exchange correction values. We first use percentage incorporation as normalisation and visualise as a heatmap.

MBPqDF_norm1 <- normalisehdx(MBPqDF,
                             sequence = unique(MBP$pep_sequence),
                             method = "pc")


pheatmap(t(assay(MBPqDF_norm1)),
         cluster_rows = FALSE, 
         cluster_cols = FALSE,
         color = brewer.pal(n = 9, name = "BuPu"),
         main = "Stuctural variant deuterium incoperation heatmap normalised", 
         fontsize = 14,
         legend_breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2),
         legend_labels = c("0", "0.2", "0.4", "0.6", "0.8","1", "Incorporation"),
         labels_col = xannot)

Now, we demonstrate a back-exchange correction calculation. The back-exchange value are fictious by the code chunk below demonstrates how to set this up.

# made-up correction factor
correction <- (exchangeableAmides(unique(MBP$pep_sequence)) + 1) * 0.9

MBPqDF_norm2 <- normalisehdx(MBPqDF,
                             sequence = unique(MBP$pep_sequence),
                             method = "bc", 
                             correction = correction)


pheatmap(t(assay(MBPqDF_norm2)),
         cluster_rows = FALSE, 
         cluster_cols = FALSE,
         color = brewer.pal(n = 9, name = "BuPu"),
         main = "Stuctural variant deuterium incoperation heatmap normalised", 
         fontsize = 14,
         legend_breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2),
         legend_labels = c("0", "0.2", "0.4", "0.6", "0.8","1", "Incorporation"),
         labels_col = xannot)

6 Functional data analysis of HDX-MS data

The hdxstats package uses an empirical Bayes functional approach to analyse the data. We explain this idea in steps so that we can get an idea of the approach. First we fit the parametric model to the data. This will allow us to explore the HdxStatModel class.

res <- differentialUptakeKinetics(object = MBPqDF[,1:100], #provide a QFeature object
                                  feature = rownames(MBPqDF)[[1]][37], # which peptide to do we fit
                                  start = list(a = NULL, b = 0.0001,  d = NULL, p = 1)) # what are the starting parameter guesses
## Warning in differentialUptakeKinetics(object = MBPqDF[, 1:100], feature =
## rownames(MBPqDF)[[1]][37], : NAs introduced by coercion

Here, we see the HdxStatModel class, and that a Functional Model was applied to the data and a total of 7 models were fitted.

res
## Object of class "HdxStatModel"
## Method: Functional Model 
## Fitted 7

The nullmodel and alternative slots of an instance of HdxStatModel provide the underlying fitted models. The method and formula slots provide vital information about what analysis was performed. The vis slot provides a ggplot object so that we can visualise the functional fits.

res@vis

We can also explore other functional model rather than the weibull for example a model that is the sum of two logistic curves.

res2 <- differentialUptakeKinetics(object = MBPqDF[,1:100], #provide a QFeature object,
                                  formula = value ~ a*(1 - exp(-b * timepoint)) + c * (1 - exp(-d * timepoint)),
                                  feature = rownames(MBPqDF)[[1]][37], # which peptide to do we fit
                                  start = list(a = 0, b = 0.0001,  c = 0, d = 0.0001)) # what are the starting parameter guesses
## Warning in differentialUptakeKinetics(object = MBPqDF[, 1:100], formula = value
## ~ : NAs introduced by coercion

We can then visualize the output of our curve fitting

res2@vis

one question would be can we statistically say whether one model is better than another. We can do this by computing the likelihood ratio between the two proposed models. We can see the likelihoods from the second model are generally higher.

logLik(res)
##       null       alt1       alt2       alt3       alt4       alt5       alt6 
## -43.887971  13.126180  11.888973  14.550056  16.637855  12.387702   9.762426 
##       alt7 
##  29.610175
logLik(res2)
##      null      alt1      alt2      alt3      alt4      alt5      alt6      alt7 
## -41.95843  23.54976  22.10180  25.32149  23.77991  22.07919  26.54933  52.04347

Let compute minus twice the difference of the log-likelihoods:

diff <- -2 * (logLik(res) - logLik(res2))

Wilk’s theorem tell us that this statistics is asymptotically distributed as chi-squared with degree of freedom equal to the difference of the degrees of freedom of each model. Each model has 4 degrees of freedom so in this case there is no need to apply any theoretical results. The second model is preferred, as it as higher log-likelihood